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Abstract 

We present a novel approach to semi- 
supervised learning which is based on statis- 
tical physics. Most of the former work in the 
field of semi-supervised learning classifies the 
points by minimizing a certain energy func- 
tion, which corresponds to a minimal k-way 
cut solution. In contrast to these methods, 
we estimate the distribution of classifications, 
instead of the sole minimal k-way cut, which 
yields more accurate and robust results. Our 
approach may be applied to all energy func- 
tions used for semi-supervised learning. The 
method is based on sampling using a Mul- 
ticanonical Markov chain Monte-Carlo algo- 
rithm, and has a straightforward probabilis- 
tic interpretation, which allows for soft as- 
signments of points to classes, and also to 
cope with yet unseen class types. The sug- 
gested approach is demonstrated on a toy 
data set and on two real-life data sets of gene 
expression. 



^ ' 1. Introduction 



Situations in which many unlabelled points are avail- 
able and only few labelled points are provided call for 
semi-supervised learning methods. The goal of semi- 
supervised learning is to classify the unlabelled points, 
on the basis of their distribution and the provided la- 
belled points. Such problems occur in many fields, in 
which obtaining data is cheap but labelling is expen- 
sive. In such scenarios supervised methods are imprac- 
tical, but the presence of the few labelled points can 
significantly improve the performance of unsupervised 
methods. 
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The basic assumption of unsupervised learning, i.e. 
clustering, is that points which belong to the same 
cluster actually originate from the same class. Cluster- 
ing methods which are based on estimating the density 
of data points define a cluster as a 'mode' in the dis- 
tribution, i.e. a relatively dense region surrounded by 
relatively lower density. Hence each mode is assumed 
to originate from a single class, although a certain class 
may be dispersed over several modes. 

In case the modes are well separated they can be eas- 
ily identified by unsupervised techniques, and there is 
no need for semi-supervised methods. However, con- 
sider the case of two close modes which belong to two 
different classes, but the density of points between 
them is not significantly lower than the density within 
each mode. In this case density based unsupervised 
methods may encounter difhculties in distinguishing 
between the modes (classes), while semi-supervised 
methods can be of help. Even if a few points are la- 
belled in each class, semi-supervised algorithms, which 
cannot cluster together points of different labels, are 
forced to place a border between the modes. Most 
probably the border will pass in between the modes, 
where the density of points is lower. Hence, the forced 
border 'amplifies' the otherwise less noticed differences 
between the modes. 

For example, consider the image in Fig. ^. Each 
pixel corresponds to a data point and the similarity 
score between adjacent pixels is of value unity. The 
green and red pixels are labelled while the rest of the 
blue pixels are unlabelled. The desired classification 
into red and green classes appears in Fig. ^. It is 
unlikely that any unsupervised method would parti- 
tion the data correctly (see e.g. Fig. ^) since the 
two classes form one uniform cluster. However, using 
a few labelled points semi-supervised methods which 
must place a border between the red and green classes 
may become useful. 

In recent years various types of semi-supervised learn- 
ing algorithms have been proposed, however almost 
all of these methods share a common basic approach. 
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They define a certain cost function, i.e. energy, over 
the possible classifications, try to minimize this energy, 
and output the minimal energy classification as their 
solution. Different methods vary by the specific energy 
function and by their minimization procedures; for ex- 
ample the work on graph cuts E], minimizes the 
cost of a cut in the graph, while others choose to min- 
imize the normalized cut cost jllj!20j, or a quadratic 

cost naEi]. 

As stated recently by |Q, searching for a minimal en- 
ergy has a basic disadvantage, common to all former 
methods: it ignores the robustness of the found solu- 
tion. Blum et al. mention the case of several minima 
with equal energy, where one arbitrarily chooses one 
solution, instead of considering them all. Put differ- 
ently, imagine the energy landscape in the space of 
solutions; it may contain many equal energy minima 
as considered Blum et al., but also other phenomena 
may harm the robustness of the global minimum as 
an optimal solution. First, it may happen that the 
difference in energy between the global minimum, and 
close by solutions is minuscule, thus picking the mini- 
mum as the sole solution may be incorrect or arbitrary. 
Secondly, in many cases there are too few data points 
(both labelled and unlabelled) which may cause the 
empirical density to locally deviate from the true den- 
sity. Such fluctuations in the density may drive the 
minimal energy solution far from the correct one. For 
example, due to fluctuations a low density "crack" may 
be formed inside a high density region, which may er- 
roneously split a single cluster in two. Another type of 
fluctuation may generate a "filament" of high density 
points in a low density region, which may unite two 
clusters of different classes. In both cases, the minimal 
energy solution is erroneously 'guided' by the fluctu- 
ations, and fails to find the correct classification. An 
example of the latter case appears in Fig.^; the clas- 
sifications provided by three semi-supervised methods 
appear in Fig.^-f, fail to recover the desired classifi- 
cation, due to a 'filament' which connects the classes. 

Searching for the minimal energy solution is equiva- 
lent to seeking the most probable joint classification 
(MAP). A possible remedy to the difficulties in this 
approach may then be to consider the probability dis- 
tribution of all possible classifications. Blum et al. 
provided a first step in this direction using a random- 
ized min-cut algorithm. In this work we provide a 
different solution based on statistical physics. 

Basically each solution in our method is weighed by 
its energy E (classification), also known as the Boltz- 
mann weight, and its probability is given by: 



where the "temperature" T serves as a free parameter, 
and the energy E takes into account both unlabelled 
and labelled points. Classification is then performed 
by marginalizing (^, thus estimating the probability 
that a point i belongs to a class c. This formalism 
is often referred to as a Markov random field (MRF), 
which has been applied in numerous works, includ- 
ing in the context of semi-supervised learning by 22j . 
However, they seek the MAP solution (which corre- 
sponds to T = 0), while we estimate the distribution 
itseff (at T > 0). 

Using the framework of statistical physics has several 
advantages in the context of semi-supervised learning: 
First, classification has a simple probabilistic interpre- 
tation. It yields a fuzzy assignment of points to class 
types, which may also serve as a confidence level in the 
classification. Secondly, since exactly estimating the 
marginal probabilities is, in most cases, intractable, 
statistical physics has developed elegant Markov chain 
Monte-Carlo (MCMC) methods which are suitable for 
estimating semi-supervised systems. Due to the in- 
herent complexity of semi-supervised problems, 'stan- 
dard' MCMC methods, such as the Metropolis [HI 
and Swendsen-Wang ,17 methods provide poor re- 
sults, and one needs to apply more sophisticated al- 
gorithms, as discussed in sectional Thirdly, using sta- 
tistical physics allows us to gain an intuition regarding 
the nature of a semi-supervised problem, i.e., it allows 
for a detailed analysis of the effect of adding labelled 
points to an unlabelled data set. In addition, our 
method also has two practical advantages: (i) while 
most semi-supervised learning methods consider only 
the case of two class types, our method is naturally ex- 
tended to the multi-class scenario, (ii) Another unique 
feature of our method is its ability to suggest the exis- 
tence of a new class type, which did not appear in the 
labelled set. 

Our main objective in this paper is to present a frame- 
work, which can later be applied in different directions. 
For example, the energy function in |^ can be any of 
the functions used in other semi-supervised methods. 
In this paper we chose to use the min-cut cost func- 
tion. We do not claim that using this cost function 
is optimal, and indeed we observed that it is subopti- 
mal in some cases. However, we aim to convince the 
reader that applying our method, to any energy func- 
tion, would always yield equal or better results than 
merely minimizing the same energy function. 



Pridassification; T) oc f,-E(classif^cation)/T ^^^ 
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Figure 1. a. The unlabelled data in blue; labelled points are marked in green and in red. Each of the 1360 pixels correspond 
to a data point; the labeled pixels were enlarged for clarity, b. The correct classification, c. Clustering results of the 
unsupervised normalized cut algorithm |15| . d. The min-cut solution, e. The results of the semi-supervised consistency 
method of ^J. f. The outcome of the spectral graph transducer algorithm |11| . which is a semi-supervised extension of 
the normalized cut algorithm. 



Our work is closely related to the typical cut criterion 
for unsupervised learning, first introduced by ^ in the 
framework of statistical physics and later in a graph 
theoretic context by |5]. The method introduced in 
this work can be viewed as an extension of these clus- 
tering algorithms to the semi-supervised case. 

The paper is organized as follows: Section |21 presents 
the model, and Section |21 discusses the issue of esti- 
mating marginal probabilities. Section 0] presents the 
qualitative effect of adding labelled points. Our semi- 
supervised algorithm is outlined in Section|3 Sectional 
demonstrates the performance of our algorithm on a 
toy data set and on two real-life examples of gene ex- 
pression data. 

2. Model definition 

In our model each data point i,i = 1, . . . ,iV, corre- 
sponds to a random variable, or spin, Si which can 
take one of q > 2 discrete states. The number of states 
q matches the number of class types in the labelled 
set. A certain classification of the data set then corre- 
sponds to a vector S, S — {si, . . . ,sn}- Assume that 
the first AI ^ N points are labelled, i.e., the state 
of spin Sfc, 1 < fc < Af is clamped to a spin value Ck, 
which corresponds to the class type of point k. Hence 
the energy E in our case is simply the Potts model en- 
ergy of a granular ferromagnet with an external field; 






AI 



J2hkil~S,,,,J. (2) 



fc=i 



where Jij > is a predefined similarity between points 
i and j, {i,j) stands for all edge of neighboring graph, 



and S^ 



— 1 when s, = s, and zero otherwise. The 



second term which corresponds to the labelled points, 
is known as the 'external field' term. In case the value 
of Sk is different from the point's assigned class Ck, the 
energy is increased by a value hk- We used hk = oo, 
which assigns non-zero probability only to classifica- 
tions in which Sk = Ck- Notice that one can introduce 
uncertain labels by using finite values of hk , but we do 
not consider this case in this work. 

The major problem in applying the suggested method 



concerns the difficulty in calculating Q. Since the 
number of possible classifications is exponential in N, 
one often needs to apply sampling MCMC algorithms, 
which are considered in the next section. 

3. Estimating marginal probabilities 

Introducing labelled points inherently changes the 
properties of the system and poses great difficulties 
in MCMC sampling. Labelled points may introduce 
'frustration' into the system (a term borrowed from 
statistical physics); if, for example, point i is connected 
to a couple of differently labelled points j and fc, it is 
'frustrated' since whenever it matches one of them it 
contradicts the other. Such frustration appears also 
in physical systems of spin glasses, and is known to 
complicate their analysis. 

The difficulty in sampling from spin glass systems re- 
sults from their ragged energy landscape. The energy 
landscape can be described as being composed of sev- 
eral 'valleys' which are surrounded by very high en- 
ergy barriers, which the sampling method is unable 
to traverse at low temperatures. As a results, 'stan- 
dard' MCMC methods, e.g. the Metropolis and the 
Swendsen-Wang methods, are confined to a certain 
'valley' for an exponential number of Markov chain 
steps, thus their estimates may be highly biased. 

Extended MCMC methods is a title given for a family 
of methods which enable efficient sampling in complex 
scenarios such as spin-glasses ^]. Extended MCMC 
methods solve the sampling problem by allowing the 
system to 'jump' between 'valleys'. This is implicitly 
performed by letting the system pass through high en- 
ergy configurations, which most likely erase any mem- 
ory of the originating 'valley'. In this work we applied 
the Multicanonical Monte-Carlo method "H", which is 
a member of the extended MCMC methods. 

The Multicanonical Monte-Carlo method first esti- 
mates the density of states D{E), i.e. the number 
of different classifications at a given energy. It then 
generates a sample of classifications, {S}, drawn from 
the distribution Pri3(S) ex 1/D{E{S)), which can then 
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be used to recover the Boltzniann distribution Q|, for 
all temperatures T at once. Sampling from Pr]j(S) 
yields a uniform distribution over all energy levels, 
which forces the MCMC to pass through high energy 
configurations and by that overcome the energy barri- 
ers. For further details about the method we refer the 
reader to [TOj . 

Before presenting the effect of labelled points, we 
would like to shortly discuss an alternative to MCMC 
sampling, which is to approximate the marginal prob- 
abilities using methods from the field of graphical 
models. The intimate connections between statistical 
physics and graphical models have been demonstrated, 
e.g. by |19|. Our Boltzmann distribution corresponds 
to an undirected graphical model, thus estimating the 
marginal probabilities is equivalent to performing in- 
ference in this model. Since exact inference, via the 
junction tree algorithm |14j . is generally intractable, 
one needs to resort to approximate methods, such as 
(loopy) belief propagation (BP) or generalized belief 
propagation (GBP) |19j . In our experimental study 
the performance of BP was rather poor, probably since 
the graphical model of a typical semi-supervised prob- 
lem contains many short loops. On the other hand 
the performance of GBP was excellent, when applied 
to two dimensional problems, as in Fig.^ However, to 
date there exists no principled way of applying GBP to 
a general graph, which guarantees good approximate 
inference, therefore we consider only MCMC sampling. 

4. The effect of labelled points 

As explained in section|2|the labelled points inherently 
complicate the sampling of the system. On the other 
hand, adding labelled points has the desired effect on 
classification. In order to understand this phenomenon 
we first describe the unsupervised case and then qual- 
itatively explain the effects of adding labelled points. 

The properties of a system, governed by the Boltz- 
mann distribution Q, changes with the temperature 
T. In many physical systems, the temperature range 
T > can be divided into intervals, or phases, each 
of which has its own global properties. Granular- 
ferromagnets without external fields (i.e. keeping the 
first term of (|2Il), which correspond to the unsuper- 
vised case, are known to have three phases ^H]; a 
low-temperature phase in which the system is ferro- 
magnetic, i.e. most of the spins are assigned the same 
value; a high temperature phase in which the system is 
paramagnetic, i.e. the values assigned to the spins are 
nearly independent; and an intermediate phase termed 
the super-paramagnetic (SP) phase. In this phase, 
which is the most relevant for clustering, all spins of a 



grain (i.e. a cluster) are assigned a certain value, with 
different values at different grains. The clusters in the 
data can be identified in this SP phase; the larger the 
temperature interval of this phase, the more significant 
and stable is the clustering solution J12j . 

Adding labelled points changes the system's behavior. 
First, it effectively increases the strength of the inter- 
action between spins near labelled points, which can 
be interpreted as an increase of their local density. As 
a result there is an increase in the transition tempera- 
ture between the ordered SP phase and the unordered 
paramagnetic phase, thus increasing temperature in- 
terval of the SP phase at its the upper limit. 

A second effect happens at low temperatures. For 
example, consider the case of two dense grains, each 
containing a labelled point of a different type, which 
are separated by a lower density region. In the SP 
phase the spins in each of the grains attain their cor- 
rect class, but the spins in the low density region are 
still unordered. As the temperature is lowered the two 
classes 'penetrate' into the low density region until a 
'border' between the classes is formed. Hence, from a 
semi-supervised perspective, the labelled points cause 
the low density region to be classified. Notice that at 
this temperature the unsupervised case is already at 
the ferromagnetic phase, where the two clusters are 
united. Hence, the labelled points also decrease the 
lower limit of the SP phase, which together with in- 
creasing its upper limit, results in a larger temperature 
interval relevant for classification. 

When the temperature is further lowered, a different 
classification may appear. For example, one of the 
class types may overtake the whole system, similar to 
the min-cut solution in Fig. ^, but, of course, we are 
not interested is such a solution. 

Fig. 2 presents the effect of adding labelled points in 
the case of Fig. QJi. We plot the number of misclas- 
sified points using the algorithm in Sec. El as a func- 
tion of T, in the unsupervised (US) and in the semi- 
supervised (SS) cases. In this data set we calculated 
(^ exactly using the junction tree algorithm (exact 
US and exact SS), and compared it to Multicanoni- 
cal sampling (MC) . Notice that adding labelled points 
decreases the number of errors dramatically, achieving 
almost correct classification over a large temperature 
interval (0.5 < T < 1). At lower temperatures, which 
correspond to the min-cut solution (Fig.Qi), the num- 
ber of misclassified points is large. 
5. The algorithm 

Our semi-supervised learning algorithm is comprised 
of two parts: an estimation part, and a classification 
part which are described below. 
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Estimation consists of three stages: 

• Map each point i to a q-state random variable Si, 
where q is the number of class types of labelled points. 

• Construct a graph G of neighboring points i and j 
and assign a their pairwise similarity Jij > 0. 

• Estimate the marginal probabilities pi{si]T) and 
Pij{si,Sj]T), as explained in Sec.|3| 

Classification of a point z at a temperature T can 
simply be performed by argmax2<fc<gPi(si = k;T). 
However, we suggest a heuristic method which is 
slightly more elaborate, but takes into account the con- 
fidence in the classification, and also allows to identify 
new class types. This heuristic is comprised of two 
steps: 

• Classify 'confident ' points, using single point prob- 
abilities Pi'. For each point i we find the two most 
probable class assignments and their probabilities; pi 
and p2. In case pi ~ P2 > t, where r > is a user 
defined confidence parameter, we classify point i ac- 
cording to the type which corresponds to pi . 

• Classify the remaining, less 'confident ' points, using 
the pairwise probabilities pij : Following the intuition 
of 0] we estimate the pairwise correlations between i 
and j defined as 

C*j(r) = '^Pijisi = k, Sj ^ k;T), 
fc=i 

where the correlation ranges from the random level 
1/q, to a perfect correlation value of 1. We then delete 

edges from the graph G for which dj (T) < | ( 1 + - ) j 
i.e., halfway between random level and perfect correla- 
tion, and find the connected components of the result- 
ing graph. Each 'unconfident' point j is then classified 
according to the connected component to which it be- 
longs. In case j belongs to a connected component 
which contains points (already) classified as Ck, then 
j is assigned to c^ . If j belongs to a connected com- 



ponent which contains points which were assigned to 
several different classes, it is remains unassigned and is 
marked as "confused" between these classes. Finally, 
all the points which belong to a connected component 
that does not contain any classified point are marked 
as a new class. 

Notice that the classification depends on T. The ratio- 
nal is to supply the user with a classification 'profile' of 
each data point, over all temperatures. Since statisti- 
cally significant classifications span large temperature 
intervals, such a 'profile' is rather limited in size. For 
example, the 'profile' of a point j which resides in class 
ci , close to the border with class C2 would contain two 
classifications: at low temperatures j is assigned to ci, 
and at higher temperatures it is marked as "confused" 
between ci and C2. 

As for the value of r, our experimental study has 
shown that classification performance decreases with 
increasing the value of r (data not shown), thus we 
chose to use t = 0.1. 

In case there are no labelled points pi{si = k) = 
1/q \/i, k then all points are treated as 'unconfident', 
and our 'classification' simply coincides with the clus- 
tering procedure of ^. 

6. Experimental results 

We present results over three data sets: A toy data 
set, and two real-life data sets of gene expression. 

6.1. Toy data 

Fig. 3 presents a toy data set similar to Fig. ^ which 
contains 1306 data points from three classes. As in 
the former toy data, the similarity between adjacent 
pixels is of unit value, hence the three classes form one 
connected cluster, which can not be separated without 
the labelled points. 



Figure 3. A toy data set 
comprised of three classes. 
Each pixel corresponds to a 
data point, and Jij — 1 for 
all adjacent pixels i and j. 
The labelled points are ran- 
domly sampled with a uni- 
form distribution. In or- 
der to enable correct classi- 
fication the labelled points 
from the lower two classes 
are sampled from the area 
marked by a rectangle. 
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In order to evaluate the performance of our approach 
we carried two sets of experiments. In the first set we 
randomly chose M^ labelled points {M = 5, 10, 15 and 
20) from the two lower (green and red) classes, i.e. q = 
2, while in the second set of experiments the labelled 
points where randomly chosen from all three classes, 
i.e. q = 3. For each value of M and q we evaluated 100 
instances (realizations) of random labelling, and the 
number of misclassified points appear in Fig. 0] The 
number of misclassified points in the unsupervised case 
was 1001, and was evaluated as explained in Sec. El 

As expected, incorporating even a few labelled points 
has a significant impact on the number of misclassified 
points. As can be seen, the results highly depend on 
the specific instance of labelled points, hence the aver- 
age performance is less informative. Therefore the in- 
stances are presented in an increasing order of misclas- 
sified points of our approach (MC), while the other two 
lines correspond to the graph-cuts method^ (GC) and 
to the local-global consistency method |2J| (LGC). In 
order to plot the MC line in Fig. 01 we automatically se- 
lected a temperature, T* , in which classification is sig- 
nificantly different from the ground state (T = 0) solu- 
tion, and is also most 'stable'. At each T we consider 
only the points, c{T), whose classification is both con- 
fident and different than the T — solution. We define 
a score r]{T) = \\c{T)\\ ■ s{T), where s{T) is the aver- 
age temperature interval in which the classifications 
of c(T) remain unchanged. Then, T* = argmax77(T), 
and in case r]{T*) < 770, we set T* — 0. In general, 
we recommend to use the 'profiles' of the points, since 
there may be several 'stable' solutions at different tem- 
peratures. 

In comparing our method and graph-cuts, both of 
which use the same energy function, it can be observed 
that our method always achieves an equal or lower 
number of misclassifications than graph-cuts. How- 
ever, it appears that in several instances of labelled 
points, it is preferable to apply the energy function 

of mi- 

Also it seems that for q ^ 2 our method significantly 
outperforms the other two methods, mainly due to its 
ability to identify the third class type, although none 
of its points is labelled. For g = 3 the solutions of 
graph-cuts and of our method become similar as the 
number of labelled points increases. 

6.2. Leukemia gene expression data set 

In this section we present the results of applying our 
algorithm to a real- world problem of cancer classifica- 

^M denotes the total number labelled points. 
^In order to apply graph-cuts when g = 3 we used the 
approximation of [7]. 



tion and class discovery. In cancer research, there is a 
particular need for semi-supervised techniques, as the 
classes and sub-classes (cancer types) are only partially 
known. Hence one needs to apply methods that can 
help partition the data into known classes and possibly 
identify novel ones. 

Our example is based on gene expression data^ of acute 
leukemia published by . They analyzed three differ- 
ent types of acute leukemia; acute myeloid leukemia 
(AML), acute lymphoblastic leukemia (ALL) and a 
sub-type of ALL which carries a chromosomal translo- 
cation in the MLL gene. Armstrong et al. show (in a 
supervised manner) that this sub-type has a distinct 
molecular profile and can be considered a new type of 
leukemia termed MLL. 

We applied our algorithm to the 57 leukemia samples 
in (20 ALL, 20 AML and 17 MLL samples), each 
described by the expression levels of the 200 genes 
with largest variance across samples. The similarity 
between samples was calculated in a standard manner 
in this field^. The same as in Sec. IB. II we carried two 
set of experiments. In the first set of experiments we 
randomly chose M points (M = 2, 4, 6) from the ALL 
and AML samples but not from the MLL class, and 
in the second set of experiments M labelled points 
(M = 3, 6, 9) were randomly selected from all three 
classes. The results appear in Fig. 5 in the same for- 
mat as in Fig. ^ The number of misclassified points 
in the unsupervised case was 11. 

As in the previous data set, our method always 
achieves an equal or lower number of misclassifications 
than graph-cuts. Notice that in the q = 2 case, our 
method is able to predict the existence of MLL, while 
all 17 MLL points are misclassified in the other meth- 
ods. It appears that for this data set, applying the 
min-cut cost function is almost always superior to the 
quadratic cost function of [2I]- Another interesting 
phenomenon is the relatively low number of misclassi- 
fications in the unsupervised case. It happens that in 
20% — 40% of the instances (depending on q and M) it 
is preferable to apply our method without the labelled 
points. 

6.3. Yeast cell cycle gene expression data 

In this section we describe an application of our 
method to a real-life problem in cellular biology for 

^Simultaneous measurements of mRNA levels of thou- 
sands of genes in a single tissue sample. 

"^The expression level of each gene is 'normalized' by 
subtracting its mean expression over all samples, and di- 
vided by its standard deviation. The distance between 
samples i and j, dij, is then the Euclidean norm over their 
200 genes, and Jij = exp(— d? /a^) where a = (d) . 
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Figure 5. The same as in 
Fig. 01 but for 57 leukemia 
samples. The values of 5 and 
A/ appears in each panel. 
The number of misclassified 
points in the unsupervised 
case was 11. 



which the true solution is partly unknown. This con- 
cerns the assignment of the yeast's genes to the stage 
in the cell cycle in which they are expressed. While 
the yeast's genome is well-characterized, the function 
of many of its genes remains to be determined. There- 
fore, correctly assigning genes to their cell cycle phase 
may shed light on their function and help connect them 
to the emerging cellular network. 

The Yeast's cell cycle was studied by various re- 
searchers, typically by applying unsupervised meth- 
ods, e.g. jniP- Here we use the data of Spellman et 
al. which measured the expression level of the yeast's 
genes at 18 specific times over the course of two cell- 
cycles, thus data consists of 18 measurements of more 
than 6000 genes. Due to experimental difficulties some 
of the entries in this 18 x 6000 matrix are missing, 
hence following P we used a subset of 4523 genes for 
which at least 15 out of the 18 readings are available. 
For 77 of these genes, the assignment to one of 5 stages 



in the cell cycle (M/Gl, Gl, S, S/G2 and G2/M) is 
well established. Therefore, we have a multi-class clas- 
sification problem [q = 5) of 4523 points in 18 dimen- 
sions, with 77 labelled points. As a similarity measure 
we used a standard protocol as in Sec. 16.21 

Since ground truth is not available in this problem we 
decided to measure the success rate of our method by 
comparing our results to the proposed classification of 
Spellman et al. They used several biological criteria in 
order to rank the genes according to their participation 
in the cell-cycle. Their list consists of 604 out of the 
4523 genes, and 69 of them also appear in the list of 
known 77 genes, leaving 535 genes as a test set. 

We classified the 535 points to one of the 5 classes, 
or marked them as 'confused' between classes. When 
considering only the classified points and treating the 
'confused' points as errors our average success rate is 
32% (over the 5 classes), while graph-cuts reaches 20%. 
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7. Discussion 

We introduced an approach to semi-supervised learn- 
ing which is based on statistical physics. Our approach 
may be applied to any energy function, and yields an 
equal or better performance than minimizing the same 
energy function. Our method is most suitable in case 
the number labelled points is small, since its classifica- 
tions would coincide with the minimal energy solution 
as the number of labelled points becomes larger. 

The method is based on the Multicanonical MCMC 
method, which allows for an efficient estimation of the 
Boltzmann distribution, even in the multi-class sce- 
nario. The basic difficulty in methods which seek the 
minimal energy, i.e. work at T = 0, is that the multi- 
class scenario is NP-hard. We avoid such difficulties 
since the interesting regime for classification is T > 0. 

The computational complexity of MCMC is hard to es- 
timate, as it is problem dependent. A large multi-class 
data set may indeed be difficult to sample, and require 
a long run, which calls for even more efficient MCMC 
or approximation methods. However, we hope to have 
convinced the reader that our performance gain over 
other, more efficient, methods may be worthwhile. 

Although our results display the advantages of incor- 
porating labelled points in an unsupervised setting, 
the performance highly depends on the specific choice 
of labelled points, and in some cases it is even prefer- 
able to ignore the labelled points. A related phe- 
nomenon already appeared in previous work, e.g. jH|, 
and should be thoroughly addressed. 
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